Climate change is a very big problem in contemporary times, and India, as the second most populous country in the world, contributes significantly to air pollution with many cities in the top 50 most polluted cities in the world. But the extent of pollution is not properly studied and used in decision making in India. Air quality monitoring stations are present in many cities but the data isn’t used for predictions in most cases, but rather as an observation. Another problem is that these AQI monitoring stations aren’t present in small cities. The project aims to build a forecasting model to predict AQI. The project focuses on the city of Indian cities like Delhi, Mumbai and Chennai which are infamous for their pollution levels.
The plots on the side reveal how developing countries (Low Income Mediterranean, South East Asia) contribute most to the particulate pollution. Moreover the other plots reveal how India compares to the rest of the world
LMIC: Low Middle Income Countries
LMIC: High Income Countries
The plots on the side show the amounts of pollutants and their spread across various cities in India. Particulate Matter is not shown here. The pollutants shown here make up significant portion of the Air Quality Index which is a widely used measure. We will be using the AQI as well in this project
By looking at the autocorrelation function (ACF) and partial autocorrelation (PACF) plots we can tentatively identify the numbers of AR and/or MA terms that are needed.ACF plot is merely a bar chart of the coefficients of correlation between a time series and lags of itself. The PACF plot is a plot of the partial correlation coefficients between the series and lags of itself.
We have used the algorithm proposed by Hyndman & Khandakarin 2008 which is implemented through auto.arima in R. This simplifies the process but increases time complexity. Since we have a relatively small data set we have gone ahead with this.
SARIMA is a Integrated AR, MA model fitted for daily/monthly values as well as an additional seasonal term. We have fitted the model using auto.arima.
The SARIMAX model has some regressors as well, to model the variable change with time. This model was fitted using auto.arima but the results were unfavorable due to missing values and compounding of errors.A Distributed Lag model was also checked but gave worse results.
The Ljung Box test of the residuals show if the residuals are white noise or if they have predictability. The squared residual plot checks conditional heteroscedasticity.
There are various error measures that can be used for model selection - AIC, BIC, MSE, MAE, MAPE, etc. We have used MAPE. We have chosen the most robust model as that is also an important criteria in selection.
The Chennai model has a lot of missing values and hence cannot be predicted properly, hence the apparent lack of seasonality. We drop Chennai from our analysis.
[1] "Delhi"
Series: AQID.ts
ARIMA(1,1,2)(0,1,0)[365]
Coefficients:
ar1 ma1 ma2
0.5423 -0.6921 -0.2647
s.e. 0.0347 0.0383 0.0334
sigma^2 estimated as 4716: log likelihood=-9282.91
AIC=18573.83 AICc=18573.85 BIC=18595.45
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.7725529 62.04998 41.97265 -4.096785 20.52469 0.5027082
ACF1
Training set -0.003184295
[1] "Mumbai"
Series: AQIM.ts
ARIMA(1,0,0)(0,1,0)[365] with drift
Coefficients:
ar1 drift
0.7150 -0.0262
s.e. 0.0339 0.0121
sigma^2 estimated as 675.2: log likelihood=-1972.82
AIC=3951.64 AICc=3951.7 BIC=3963.78
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.01404403 18.98299 9.353563 -1.96953 9.687569 0.3399684
ACF1
Training set 0.02590205
[1] "Chennai"
Series: AQIC.ts
ARIMA(4,1,3)
Coefficients:
ar1 ar2 ar3 ar4 ma1 ma2 ma3
1.5983 -0.8616 0.1788 -0.0069 -1.9292 1.162 -0.2259
s.e. 0.2212 NaN NaN 0.0138 0.2212 NaN NaN
sigma^2 estimated as 1228: log likelihood=-9579.86
AIC=19175.72 AICc=19175.8 BIC=19220.23
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -1.28413 34.9666 22.85284 -7.48814 20.17663 0.4631568 -0.001062613
These ARIMAX models are done using all the meteorological variables. This gives models with larger MAPE than the ARIMA model without an X variable/ matrix. This can be due to high deviations in the meteorological variables. A distributed lagged model was also fitted to check. ARIMA with no X variable is the best in any case.
[1] "Delhi"
Series: ydl
Regression with ARIMA(1,0,0) errors
Coefficients:
ar1 temp pressure humidity wind
0.8746 1.8201 0.0421 0.2311 0.3091
s.e. 0.0191 0.5682 0.0214 0.1363 0.3253
sigma^2 estimated as 426.9: log likelihood=-3498.19
AIC=7008.37 AICc=7008.48 BIC=7036.38
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.0194588 20.59614 13.4242 -3.261841 12.46601 1.011254 0.03280456
[1] "Mumbai"
Series: ydl
Regression with ARIMA(2,1,2) errors
Coefficients:
ar1 ar2 ma1 ma2 temp pressure humidity wind
1.0552 -0.4096 -1.2309 0.3260 -1.2183 0.3911 -0.2039 0.0950
s.e. 0.1460 0.0964 0.1531 0.1417 1.0391 0.7103 0.1454 0.4229
sigma^2 estimated as 390.3: log likelihood=-3456.66
AIC=6931.33 AICc=6931.56 BIC=6973.33
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.07747973 19.64383 12.64135 -2.087419 11.39747 0.9522809
ACF1
Training set -0.001610694
[1] "Chennai"
Series: ydl
Regression with ARIMA(1,1,1) errors
Coefficients:
ar1 ma1 temp pressure humidity wind
0.6278 -0.9694 -1.2233 -0.2736 0.1754 -0.0954
s.e. 0.0248 0.0098 1.3770 0.8662 0.2518 0.5131
sigma^2 estimated as 1406: log likelihood=-7557.43
AIC=15128.87 AICc=15128.94 BIC=15166.05
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -1.553327 37.40857 25.00667 -8.123409 21.2297 0.9668808
ACF1
Training set 0.0009152456
Box-Ljung test
data: ARIMAXD.out$residuals
X-squared = 6.5218, df = 10, p-value = 0.7697
Box-Ljung test
data: ARIMAXD.out$residuals^2
X-squared = 49.033, df = 1, p-value = 2.517e-12
Box-Ljung test
data: ARIMAXM.out$residuals
X-squared = 15.527, df = 10, p-value = 0.114
Box-Ljung test
data: ARIMAXM.out$residuals^2
X-squared = 125.9, df = 1, p-value < 2.2e-16
Box-Ljung test
data: ARIMAXC.out$residuals
X-squared = 14.726, df = 10, p-value = 0.1424
Box-Ljung test
data: ARIMAXC.out$residuals^2
X-squared = 142.62, df = 1, p-value < 2.2e-16
We can use the following forecasts to forecast future Air Quality in Delhi and Mumbai.The forecasts give AQI predictions for the next year. From our analysis we have identified the volatile nature of time series meteorological and sensor data and picked the most robust model to be the ARIMA model.
Our predictions can be used for making policy decisions such as the odd/ even rule in cities like Delhi. We can also appreciate that air quality level changes with seasons and hence should be managed accordingly.
The Ljung Box test confirmed conditional heteroscedasticity. GARCH models were fitted but the standard deviations were too high. The very nature of this data is highly volatile. Therefore we have obtained a reasonable prediction. Usage of a fourier curve for the seasonal component or modern techniques like LSTM Neural Nets might give better results.
# This code was used to scrape meterological data from the web
data<-NULL
for ( year in 2019:2020)
{
for(month in 1:12)
{
monthname<-c('january','february','march','april','may','june','july','august','september','october','november','december')
if (monthname[month] %in% c('january','march','may','july','august','october','december')){j=31}
else {j=30}
if(month==2&year%%4==0){j=29}
else if (month==2&year%%4!=0){j=28}
for(day in 1:j)
{
url<-read_html(paste0('https://en.tutiempo.net/records/vidp/',day,'-',monthname[month],'-',year,'.html'))
temp.raw<-url%>%
html_nodes('.Temp')%>%
html_text()
temp.raw<-gsub('[^\x20-\x7E]', '', temp.raw)
humid.raw<-url%>%
html_nodes('.hr')%>%
html_text()
humid.raw<-str_remove(humid.raw,'%')
wind.raw<-url%>%
html_nodes('.wind')%>%
html_text()
wind.raw<-str_remove(wind.raw,' km/h')
wind.raw<-na.omit(str_extract(wind.raw, '[[:digit:]]+'))
pressure.raw<-url%>%
html_nodes('.prob')%>%
html_text()
pressure.raw<-str_remove(pressure.raw,'hPa')
humidity<-mean(na.omit(as.integer(humid.raw)))
temp<-mean(na.omit(as.integer(temp.raw)))
pressure<-mean(na.omit(as.integer(pressure.raw)))
wind<-mean(na.omit(as.integer(wind.raw)))
Date<-as.Date(paste0(year,'-',month,'-',day))
datatemp<-data.frame(Date,temp,pressure,humidity,wind)
data<-rbind(data,datatemp)
}
}
nametemp<-paste('data',year,sep='_')
assign(nametemp,data)
}
data2<-rbind(data_2015,data_2016,data_2017,data_2018,data_2019,data_2020)
write.csv(data2,'data2.csv', row.names = FALSE)